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A Simple Probabilistic Algorithm for Detecting Community Structure in Social 

Networks 

Wei RenQ Guiying Yan, Xiaoping Liao, and Lan Xiao 
Academy of Mathematics and Systems Science 
Chinese Academy of Science 

With the growing number of available social and biological networks, the problem of detecting 
network community structure is becoming more and more important which acts as the first step 
to analyze these data. The community structure is generally regarded as that nodes in the same 
community tend to have more edges and less if they are in different communities. We come up with 
SPAEM, a Simple Probabilistic Algorithm for detecting community structure which employs E- 
M(Expectation-Maximization) algorithm. We also give a criterion based on minimum description 
length to identify the optimal number of communities. SPAEM can detect overlapping nodes and 
handle weighted networks. It turns out to be powerful and effective by testing simulation data and 
some widely known data sets. 
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I. INTRODUCTION 



Many systems can be represented by networks where 
nodes denote entities and links denote existing relation 
between nodes, such systems may include the web net- 
works [H, the biological networks H], ecological web Q 
and social organization networks [J|. Many interesting 
properties have also been identified in these networks 
such as small world [5] and power law distribution Q, 
one property that attracts much attention is the net- 
work community structure, which is the phenomenon 
that nodes within the same community are more densely 
connected than those in different communities [?[■ It is 
important in the sense that we can get a better under- 
standing about the network structure. 

This problem has been studied by researchers from dif- 
ferent perspectives. Earlier approaches for identifying 
communities could be divided into two categories: the 
hierarchical approach and divisive approach. The former 
merged two closest nodes into one community recursively 
until the whole network became one single community 
and the latter worked from the top to bottom which split 
the whole network into 2 communities recursively until 
every node was a community. These algorithms usually 
needed a measure to evaluate the closeness or dissimilar- 
ity between two nodes, see 0, H, lol. [lol. ITU . 

An important modularity measure for evaluating the 
goodness of community structure was proposed by New- 
man 1 12] and several algorithms worked by maximizing it 
[H, llSl ■ This measure was very efRcient in char- 
acterizing community structure for networks with bal- 
anced structure, however, the internal scale problem in 
its definition 17] made it fail to work well for unbal- 
anced networks such as those whose communities varied 
in size and degree sequence. Quite recently, an infor- 
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mation based algorithm by Martin [18| could accurately 
resolve communities and in particular can to some extent 
get over the scale problem of modularity. 

Also, researchers [l^ found that communities were 
ove rlappin g rather than disjoint, subsequent algorithms 
(Tsl [20II21II were designed to deal with overlapping com- 
munities. A mixture model by Newman 2^ could auto- 
matically detect patterns inside a network, meanwhile, it 
was able to detect overlapping nodes as a byproduct. 

All these state-of-art algorithms motivate us to treat 
the community detection problem as a probabilistic in- 
ference problem, we should mine the internal information 
which determines the network topology. These internal 
information gives insight to the network structure. Our 
work is inspired by probabilistic latent semantic analysis 
[2^ 1 which is a powerful algorithm in text ming, it models 
that a term occurs in a document if they are under the 
same latent topic. This idea is employed here to detect 
community structure in complex networks. 



II. METHOD 

Assume that the network considered is undirected and 
unweighted with n nodes, let A denote the adjacent ma- 
trix and N{i) the neighbors of node i. Assume that two 
nodes i and j have an edge if they belong to the same 
community, which is hidden information, see FIG [TJ Ac- 
cording to the model, community membership gij can be 
assigned to edge Cij , such that gij = r if and only if two 
nodes of edge By belong to community r. 

Suppose c communities are to be detected, let tt,. be 
the probability of community r, which can be viewed 
as the fraction of nodes in community r, r = l,2,...,c. 
The conditional probability Pr{i\r) of node i appearing 
in community r, denoted by Pr,i, satisfies X]i=i f^r,i = 1- 
In fact (3r,i can be viewed as the importance of node i 
in community r, the larger the value of (3r^i is, the more 
important node i is in community r. Let tt and (3 de- 
note set of parameters {Trr,r = 1,2, ...,c} and {(3r^i,r = 
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FIG. 1: An observed edge exists between nodes i,j due to 
the fact that they participate in the same community, which 
is hidden or unobserved information. Our model mines these 
hidden information such that they generate the network with 
the highest probabihty. 



1, 2, .., c, i = 1, 2, n}, respectively. 

Naturally, the probability of edge Cij existing and be- 
longing to community r is modeled as 

In fact, Pr(eij,gij = r|7r, /3) can be viewed as the contri- 
bution of community r to the formation of edge e.y . Then 
probability Pr{eij [tt, (3) of the existence of edge Cij is the 
sum of contribution from all communities r = l,2,..,c, 
namely 



Pr(ey>,/?) 



The observed information is the edge Cy, however, it's 
determined by the unobserved parameters tt, (3, so these 
unobserved parameters tt, (3 determine the network topol- 
ogy A, this is exactly the idea depicted in FIG[T] 

Next, the log probability of network A under parame- 
ters TT, (3 can be modeled as 

LL = log Pr(A|7r, /3) = ^Li T.r.,eN(i) log ^^e., k, P) 

= YTi=l Y.j:jeN{i) log(Er=l 7^r/3r,^/3r,i) (1) 

Parameters tt, [3 should be estimated to maximize Eq(IT]). 
However, LL in Eq(IT]) contains log of sums and is difficult 
to optimize but can be optimized easily by Expectation- 
Maximization algorithm. 



A. The EM Formula 

The EM algorithm is proposed to maximize probabil- 
ity that contains latent variables [l^l, it computes the 
posterior probability of the latent variables under the 
observed data and currently estimated parameters in 
the E-step and updates parameters with these posterior 
probabilities in the M-step. The posterior probability 



'|^,7r,/?) of edge e^j belonging to community 



r under the observed network A and parameter tt, 
note this probability by qij.r, then 



de- 



qij,r = Pr{gij = r\A,Tr,P) = 



Pr{g,j =r,v4|7r,/3) 



Pr(AK,/?) 
Pr{eij,gij ^ r,A\TT,P) 
Pr{A\Tr,l3) 



by simple deduction the E-step formula can be obtained: 



qij,r = Prigij = r\A, 77,(3) 



X/s=l '^sf3s,i(3s,j 



(2) 



In fact, qij^r is the fraction of contribution from commu- 
nity r under the observed matrix A and parameters tt, (3. 
Obviously, the expected log-probability of the network is 

n c 

LL = Y^ X! X! 1^ ^^(^« J' ' 9^j ^r\n,0) 

1=1 j:j€N{i) r=l 

n c 

= X! ^1i3-rMT^r(3r,t(3r,j) (3) 

*=1 r-j&N{i) r=l 

Combining with the constraints that tt^ — 
1, J27=i = 1, 7" = 1, 2, .., c, the lagrange form of LL is: 

n c 

L = ^ X! qi].r ln(7rr/3r,i/3r,j) 

1=1 'j:jeN{i) r=l 



(4) 



where 0,7^,^ = l,2,...,c are lagrange multipliers. The 
derivatives of L in EqQ are: 



dL 

dTTr 



dL 

dl3r,i 



i=l r-jeN{i) 

j:jGN{l) 



(5) 
(6) 



By setting the derivative in Eq([5]) ,Eq([n]) to zero and com- 
bining the constraints = ^^J27=i f^r,i = l,r = 
1, 2, .., c, the M-step formulas are: 



f3r.i — 



Si Sj:jeAr(i;) 'lij,r 
J2i J2j:£N{i) Ss=l lij, 



(7) 
(8) 



In the E-step, the membership of an edge is influenced 
by its nodes while in the M-step, the node importance 
in communities is influenced by the membership of all its 
links. By iterating E-steps and M-steps, LL in Eq([T|) will 
increase. 

Once all the parameters are estimated, the prefer- 
ence of node i belonging to community s is computed as 
Us,i — T^s(3s.i, and node i is assigned to community r such 
that r = argmaxs{us^i = T^sPs,ijS = l,2,..,r}. Ug^^s can 
be normalized so that their sum is 1 to comply with prob- 
ability normalization condition . In fact, this gives a soft 
assignment and can be used to detect overlapping nodes. 
Suppose for node i, r = argmaXs{usA — ■ns(3s,i,s — 
l,2,..,r}, empirically node i is an overlapping node if 
there is another community s such that > 1/10. 
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Parameters tt, (3 are initialized with random values and 
iterated using E-step and M-step until LL stabilizes. To 
avoid the algorithm getting stuck in a local maxima, we 
adopt restart strategy which runs the EM algorithm sev- 
eral times with different initial parameter values. 

Suppose the network has totally / edges, obvious the al- 
gorithm has a linear time complexity 0(cZ), which makes 
it an appealing approach for detecting large scale net- 
works. Note that the actual running time is also rel- 
evant to the number of EM iterations and the num- 
ber of restarts. We name our model SPAEM for eas- 
ier representation, abbreviation for Simple Probabilistic 
Algorithm which employs the idea of Expectation and 
Maximization framework. 




FIG. 2: Zachary club network: Node color indicates commu- 
nity and node size indicates Ur,i. Clearly node 9,10,31 are 
overlapping nodes and have been identified by our algorithm. 



B. Model Selection Issue 

SPAEM needs a pre-specified community number c 
and this is regarded as prior knowledge. However, the 
determination of c is a non-trivial task and is difhcult 
when no prior knowledge can be obtained. We try to 
handle it using Minimum Description Length principle 

In general, LL in Eq(IT]) increases as c increases, mean- 
while, an extra cost has to be paid for due to the increase 
in the number of parameters -fC = (c— l)-|-c(n — 1). There 
should be some balance between LL and K , and the idea 
of minimum description length principle can be employed 
here According to this principle, the code length 

needed to describe the network data is composed of 2 
parts whereas the first part describes the coding length 
of the network using SPAEM while the second part gives 
the length for coding all parameters of SPAEM itself. 
The length needed for the coding network using SPAEM 
is obviously —LL/2 (note that every edge is added twice). 
To code the parameters, a precision e has to be pre- 
specified. With this precision e, parameters smaller than 
e are not coded and get a description length of 0, other- 
wise coding the parameter tt^ needs length log(^) and 

/3r,i needs length log(^^), so the total length H for cod- 
ing the model is 

c 

H = -LL/2 + y log( — )/(^, > e) -I- 

r=l 
c n r, 

yyiog(^)/(/3.,>6) (9) 

r— 1 i—\ 

Value c should be chosen as the one which generates 
the minimum description length LI in Eq([9]). Choosing 
precision e is tricky but very important in EqQ. Smaller 
e may cause longer code for parameters and hence will 
always prefer models with small c. In fact, it's shown 
that networks are organized in a hierarchical way [26| . 
the choice e gives a lever for viewing networks in dif- 
ferent resolutions. It's intuitively clear that e should be 
on the scale of 1/n due to the normalization condition 



SiLi /3r,i — 1- Typically, if node i belongs to community 
r, will be on the scale of 1/n and be much smaller 
than 1/n if not belongs to this community. Here e is set 
to 1/ (3n) . This precision is totally empirical but as will 
be shown in next section that for well clustered networks, 
the model selection results are robust to the choice of e 
ranging from 1/n to l/(7n). 



III. EXPERIMENT 
A. Zachary Club Network 

The famous zachary club network is about acquain- 
tance relationship between 34 members [J]. The club 
splits to 2 parts due an internal dispute so it naturally 
has community structure. By setting c = 2, we run our 
algorithm and get exactly the original 2 communities, 
see FIG [H Node color indicates community and node 
size indicates the value of Ur^i = TrrPr,i which can par- 
tially measures the importance of node i in community r. 
Node 1,2,33,34 are important nodes found by SPAEM 
and can be verified intuitively from the network. 

SPAEM gives soft assignment to each node so is capa- 
ble of detecting overlapping nodes, see Table HI To com- 
pare the ability in detecting overlapping nodes, we also 
include qir used to assign communities in the mixture 
model Clearly, nodes 1,2,33,34 are not overlapping 
nodes, but node 9 is. The mixture model also can detect 
this, however, by checking corresponding probabilities, 
see Table HI SPAEM shows more accuracy revealing the 
extent of overlapping. 



B. American College Football Team Network 

The second network investigated is the college football 
network which represents the game schedule of the 2000 
season of Division I of the US college football league 0] . 
The nodes in the network represent the 115 teams, while 
the links represent 613 games played. The teams are 
divided into 12 conferences and generally games are more 
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TABLE I: Result on zachary network. Ur.i is calculated by 
Ur,i = T^r * Pr,i, which is Interpreted as the preference of node 
i belonging to community r. The qirS in the mixture mode 
[2^ are also included, to facilitate comparison, we normalize 
Us,i SO they add up to 1. 



Node ID 


Ul.i 


U2,i 






1 


3.30E-05 


0.1025 


0.00 


0.00 


2 


4.86E-06 


0.0577 


0.00 


0.00 


9 


0.0219 


0.0101 


0.69 


0.96 


13 


5.83E-36 


0.0128 


0.00 


0.00 


31 


0.0179 


0.0078 


0.70 


0.92 


33 


0.0769 


1.55E-08 


1.00 


1.00 


34 


0.1090 


8.20E-06 


1.00 


1.00 



"qir is defined in 
community r. 



as the probability of node i belonging to 




FIG. 3: Result of SPAEM for American football network; 
Node label indicates real community membership. Nodes be- 
longing to the same community detected by SPAEM are 
placed adjacently. 



frequent between members of the same conference than 
between teams of different conferences. 

The result of SPAEM and the mixture model [2^ is 
depicted in FIG O and FIGH respectively. SPAEM ba- 
sically uncovers the original community structure. How- 
ever, the mixture model gets a very different result, see 
FIG 131 This is because the group it detected is a set of 
nodes with similar linkage property so may not be com- 
mon sense community. The 3 node group in the middle 
of FIG |4] is obviously not a community. There are still 
other groups consisting of nodes from different communi- 
ties, see FIG S) The mixture model can detect patterns 
but it can not differentiate different kinds of patterns, in 




FIG. 4: Result of the mixture model l22j for American football 
network: Node label indicates real community membership. 
Nodes belonging to the same group detected are placed to- 
gether, groups which are not the common sense community 
structure are marked using cycled line. Some of these groups 
are formed by nodes from two real communities. Also there 
is a 3 node group which is clearly not a community. 



other words, it can not tell whether a detected group is 
a community. 



C. Comparison With Other Methods 



A modularity measure Q = X)r=i(^ 



(#)') is pro- 



posed by Newman |12| , where is the number of links 
in community r, is the total degree in community r, 
I is the total number of edges in the network . Good 
community structure usually indicates large value of Q. 
But there is a scale I in the definition of Q and this may 
cause problem in some networks ITtI [18| . Such networks 
include those whose communities vary in size and degree 
sequence. 

Dolphin social network reported by Lusseau [S:] pro- 
vides a natural example where communities vary in size. 
In this network, two dolphins have a link with each other 
if they are observed together more often than expected 
by chance. The original two communities have different 
sizes, with one containing 22 dolphins and the other 40. 
Setting c = 2, SPAEM only misclassifies one node and 
gets exactly the same result as the GN algorithm ^ and 
the information based algorithm [18|, however, the modu- 
larity based method [l^ gets different result, as depicted 
in FIGEl 

It is shown that the modularity algorithm works well 
for networks whose communities roughly have the same 
size and degree sequence, but may not provide very com- 
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FIG. 5: Dolphin network: Node shape denotes the real split. 
The grey line shows the result by SPAEM with only 1 mis- 
take, the left and right black line indicates the results of al- 
gorithms in [l5l |. 

petitive results when the communities differ in size and 
degree sequence [l^ . To show the way SPAEM handles 
these situations, we conduct the same 3 sets of test as 
done in (isj : symmetric, node asymmetric, Unk asym- 
metric. In the symmetric test, each network is composed 
of 4 communities with 32 nodes each, each node has an 
average degree of 16, kout is the average number of edges 
linking to nodes in different communities. In the node 
asymmetric test, each network is composed of 2 commu- 
nities with 96 and 32 nodes respectively, kout has the 
same meaning as in the symmetric test, kout is set to 
6,7,8 in both the symmetric and node asymmetric case, 
as kout increases, it becomes difficult to detect real com- 
munity structure. In the link asymmetric test, 2 commu- 
nities each with 64 nodes differ in their average degree 
sequence, nodes in one community have average 24 edges 
and in the other community have only 8 edges, setting 
kout — 2,3,4. Table HIl gives the results of our algorithm 
compared to other algorithms [H, [H, [22| . Note that the 
results of the information algorithm and the modularity 
algorithm are cited from while results of the mixture 
model are calculated by the authors. We have to admit 
that the information algorithm outperforms all other 3 
algorithms, especially in the node asymmetric and link 
asymmetric tests. SPAEM outperforms the modularity 
algorithm [l2| in the symmetric and node asymmetric 
tests. The mixture model 22] seems to perform not so 
well in the symmetric test, this might be due to that the 
groups it discovers may not be communities due to fuzzy 
structure of these networks as kout increases. 

D. Handling Weighted Network 

SPAEM can also be extended to handle weighted net- 
works. Suppose the weighted adjacent matrix of the net- 
work is Wnxn with its entries Wij,i = 1,2, ...,n,j = 
1, 2, n, then the loglikelihood of the network becomes 

n 

LL = ^ X! '^iJ^°s(^'^r(ir,if3r,j) (10) 
1=1 j:j£N(i) r 



TABLE II: Results on the benchmark test on 3 experiment: 
symmetric, node asymmetric, link asymmetric. 

Test kout SPAEM Compression" Modularity' Mixture'^ 



Symmetric 6 


0.99 


0.99 


0.99 


0.92 


7 


0.95 


0.97 


0.97 


0.81 


8 


0.84 


0.87 


0.89 


0.64 


Node 6 


0.97 


0.99 


0.85 


0.97 


Asymmetric 7 


0.92 


0.96 


0.80 


0.92 


8 


0.79 


0.82 


0.74 


0.74 


Link 2 


0.98 


1.00 


1.00 


0.99 


Asymmetric 3 


0.94 


1.00 


0.96 


0.94 


4 


0.84 


1.00 


0.74 


0.70 



"Information method in[l8|| 
''Modularity based method in [l2ll 
'^Mixture model in 



LL becomes 

n 

LL = Y^ ^ J (jy-r In Pr(e, J e r) (11) 

4=1 j:j£N{i) r 
n 

= ^ Wij(jy,rln(7rr/3r,i/3r,i) 

The E-step is unchanged but M-step becomes 

O _ I^j:jEW(i) Wi.i<lii,r- 

Intuitively the M-step formula is reasonable since links 
with greater weights contribute more to corresponding 
parameters. 

To test SPAEM on weighted networks, simulation test 
is done as that in [131 . This set of test is based on the 
above symmetric test when kout = 8: For each of the 
100 networks in the Symmetric Test with kout = 8, the 
weight of edges within a certain community is raised to 
w = 1.4,1.6,1.8,2, while the weight of edges running 
between communities is unchanged (with weight 1). As 
weight w increases from 1.4 to 2, models should improve 
their power in detecting community structure. Results 
of SPAEM are shown in Table IIIII as well as the results 
in [23] for comparison(note that the results in [23] are 
directly cited rather than recalculated) . SPAEM gener- 
ally outperforms the model in [23 ]. 

The limitation with the above simulation test is that 
any algorithm will respond positively when w increases 
and that the original unweighted networks already have 
clear community structure. Now we devise a more elab- 
orate example: consider a network with 32 nodes, each 
node pair has an edge with probability Prand , obviously, 
this network has no community structure. Let node 1 to 
16 be in group 1, node 17 to 32 be in group 2. Weight 
of edges inside each group is raised to 1.5 with probabil- 
ity Pweight but weight of cdgcs running between groups 
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TABLE III: Benchmark test on weighted network designed 
by • There are 4 communities each with 32 nodes in the 



network with kout 


= 8. As w increase from 1.4 to 2, both 


methods respond p 


ositively but SPAEM 


gets better results. 




SPAEM 


Markov " 


w = 1.4 


0.96 


0.89 


w = 1.6 


0.98 


0.94 


w = 1.8 


0.99 


0.97 


w = 2 


0.99 


0.98 



"^Random walk model in 27] 




Football network description length 




PCBE network description length 



1 1 ■ I 




FIG. 7: (a): Model selection result for American football 
team network, (b): Model selection result for the the journal 
citation network. 



FIG. 6: Results on the simulated weighted network. Node 
shape shows the original community while node color indi- 
cates the community structure detected by SPAEM. 



is unchanged. Now the only thing that can differenti- 
ate these two groups is the weight of edges. By setting 
Prand = 0.8 and Pweight = 0.8, SPAEM uncovcrs the 
two groups with only 3 mistakes, see FIG[6l This shows 
SPAEM is able to take good use of edge weight. 



E. Model Selection Test 

Now, the minimum description length H defined in 
EqQ is employed for SPAEM to select c, the optimal 
number of communities, and the precision is empirically 
set to l/3n. The criterion indicates that 11 communities 
in the American football network^?] should be detected, 
see FIG [7la, the result seems to be wrong since there 
should be 12 communities, however, there is a confer- 
ence "Independents" which can not be a really confer- 
ence because teams in it play games with adjacent con- 
ferences. This criterion also determines 4 communities 
in the journal citation network, see FIG [71b. These two 
results shows H in Eq([5]) and precision l/3n are sound. 

To further test the validity of the model selection prin- 
ciple , model selection results on the above simulation 
experiments (Symmetric, Node Asymmetric, Link Asym- 
metric) are presented in Table IIVI Combined with the 
model selection principle, SPAEM gives very compet- 
itive results in all these three tests. One weird thing 
is that in the node asymmetric case, the accuracy of 
SPAEM increases as kout increases, this is partly be- 
cause that the penalty term for describing the model pa- 
rameters in Eq.Q favors small number communities, this 
also in turn verifies that selection criterion and the pre- 



TABLE IV: Model selection result: Each entry is the fraction 
of networks identified with the correct number of commu- 
nities, the number in the parentheses indicates the average 
number of communities identified by the corresponding algo- 
rithm. 



Test 



kout SPAEM-MDL 



Symmetric 6 
7 
8 

Node 6 
Asymmetric 7 
8 

Link 2 
Asymmetric 3 
4 



1.00(4.00) 
1.00(4.00) 
0.65(3.60) 
0.82(2.18) 
0.83(2.17) 
0.93(2.07) 
1.00(2.00) 
1.00(2.00) 
1.00(2.00) 



Information" 
1.00(4.00) 
1.00(4.00) 
0.14(1.93) 
1.00(2.00) 
0.80(1.80) 
0.06(1.06) 
1.00(2.00) 
1.00(2.00) 
1.00(2.00) 



Modularity" 
1.00(4.00) 
1.00(4.00) 
0.70(4.33) 
0.00(4.95) 
0.00(4.97) 
0.00(5.29) 
0.00(3.10) 
0.00(4.48) 
0.00(5.55) 



"Information method hS 
''Modularity method [13 



cision is reasonable. 



F. Model Selection Discussion 

The model selection criterion in Eq® is sensitive to 
the choice of the accuracy e, different e would lead to 
different model selection results. Intuitively, small e will 
favor smaller number of communities and large e tends to 
identify large number of communities. In fact, it's shown 
that complex networks may be organized in the hierar- 
chical structure which allows us to view them in different 
resolutions [2^ . The accuracy e indeed provides the ca- 
pacity to detect communities in different resolutions. 

However, it is expected that for networks with well- 
defined community structure, the model selection crite- 
rion should be robust to the choice of accuracy e. To ver- 
ify this, different accuracy e ranging from 1/n to l/7n are 
tested on the journal citation network [isj, this criterion 
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TABLE V: Summary table: features of SPAEM and the Mix- 
ture model [22] 



SPAEM Mixture 
Time Cost 0(cZ) 0(d) 
Model Selection? Yes No 
Weighted Graph? Yes No 
Directed Graph? No Yes 
Detect Pattern? No Yes 



identifies 4 communities for e ranging from 1/n to l/6n 
and 3 communities when l/7n, strongly indicating that 
this network actually has 4 communities. We further test 
how different e will impact on the model selection result 
using the Symmetric Test when kout = 6, 7, 8, respec- 
tively. For e ranging from l/2n to l/7n, this criterion 
nearly always identifies the correct number of communi- 
ties when kout — 6, 7, however, when kout — 8, the accu- 
racy drops drastically, this is due to the fuzzy structure 
when there are too many edges linking to other commu- 
nities. The above results shows that the model selection 
criterion for SPAEM indeed is robust to choice of e for 
well clustered networks. 



IV. CONCLUSION 

In this paper, we propose a probabilistic algorithm 
SPAEM to resolve community structure. We have 



showed the power of SPAEM in detecting community 
structure as well as providing more useful information. 
SPAEM is also extended to handle weighted network. 
To determine the optimal number of communities, mini- 
mum description length principle is employed and tested 
on a variety of networks. 

The mixture model in is a good algorithm capa- 
ble of detecting patterns and handling directed networks 
while SPAEM focuses on detecting community struc- 
ture. Experimentally SPAEM does perform better in 
uncovering community structure and identifying overlap- 
ping nodes. Though these two algorithms seem to be sim- 
ilar with each other, they are based on different model 
assumptions. Table |V] gives a summary on features of 
the two algorithms. 
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